Author: Vladimir Koltunov
London is one of the largest and busiest city in Europe. It also one of the world's capitals of finance, culture, art and entertainment. With all of its cultural diversity it provides many opportunities for new businessess and working professionals. Greater London Authority published population projection for the city and updated it annually at both local authority and ward level for year up to 2050.
In the modern worlds of cosmopolites and fast changing trends it becomes increasingly important to examine and understand all these changes quantitatively. City government and urban planners, enterpreneurs and investors - all have an interest in identifying early opportunities and growing urban footprint in prospective districts.
My idea here is that using population projection data and current venues information (from extensive Foursquare database) I can do following analysis:
For this analysis I will use following sources of data:
First of all let's download boundaries of wards published by Greater London Authority, published on London Datastore.
It contains National Statistics data © Crown copyright and database right [2015]" and "Contains Ordnance Survey data © Crown copyright and database right [2015]"
wards.plot()
For analysis we will need centroids location for every ward. We will save this coordinates to the GeoDataFrame itself.
wards['centroid_lat'] = wards.geometry.centroid.y # centroid latitude
wards['centroid_lon'] = wards.geometry.centroid.x # centroid longitude
wards.head()
In our analysis we will focus only on central parts of Greater London. It's called Inner London. Which districts belongs Inner London we'll take from London Borough Profiles dataset
inner_wards.shape
inner_wards.plot()
radiuses = np.sqrt(inner_wards.HECTARES * 10000 / math.pi) # calculate radius of the ward if it has circular shape
print("Radius spread is [{}, {}] with mean={} and median={}".format(radiuses.min(), radiuses.max(), radiuses.mean(), radiuses.median()))
radiuses[:5]
Radius of 750 meters will be more than enough for search
import time
import random
def fill_venues():
for i, row in inner_wards_venues.iterrows():
for category in categories_list:
if row[category[0]] is None:
try:
inner_wards_venues.loc[i, category[0]] = venues_count(row.centroid_lat, row.centroid_lon, category[1])
time.sleep(0.2 + .2 * random.random())
except ValueError as err:
print('Exception at %i row for category=%s. Error: %s' % (i, category, err))
return
fill_venues()
inner_wards_venues.head()
In this project we will try to find clusters of similar areas across Inner London based on frequencies of different categories of venues which are presented in these areas.
First step was to collect needed data which is list of London areas(wards), their boundaries and central points. Also we gathered venue information in these areas using public Foursquare API.
Next step will be the exploratory analysis and clustering itself. We will normalize number of venues in different areas. And after that we will apply K-means clustering algorithm to find similar groups of wards in London based on their venue "background". We will try to find clusters with different level of development. This can help us identify highly developed and under developed (or oportunistic) areas in the city.
In final step we will take a look at population projection for different areas and clusters. The main goal here is to understand whether or not urban footprint will grow.
First of all, we will normalize all the data to scale from 0 to 1.
from sklearn.preprocessing import MinMaxScaler
scaled_wards = MinMaxScaler().fit_transform(inner_wards_venues.values[:, 4:])
wards_venues_scaled = pd.DataFrame(scaled_wards, columns=inner_wards_venues.columns[4:].values)
wards_venues_scaled.head()
plt.figure(figsize=(20, 10))
sns.boxplot(data = wards_venues_scaled)
plt.tick_params(labelsize=15)
plt.xlabel('Category', fontsize=20)
plt.ylabel('Relative number of venues', fontsize=20)
plt.xticks(rotation=60)
plt.show()
For clustering we will use KMeans clustering. After experimenting with different number of clusters I decide to stop with 5 clusters as this number gives enough diversity in clusters and good interpetability. With higher number of clusters it becomes very hard to describe each cluster as the difference between them come closer.
These are examples of box plots with different number of clusters:
| 2 clusters | 3 clusters | |
|---|---|---|
![]() |
![]() |
| 4 clusters | 7 clusters | |
|---|---|---|
![]() |
![]() |
# set number of clusters
nclusters = 5
# run k-means clustering
kmeans = KMeans(n_clusters = nclusters, n_init=20).fit(wards_venues_scaled)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[:10]
# add clustering labels
try:
wards_venues_scaled.insert(0, 'Cluster', kmeans.labels_)
except ValueError:
wards_venues_scaled['Cluster'] = kmeans.labels_
wards_venues_scaled.head()
fig, ax= plt.subplots(1, nclusters, figsize=(20,5), sharey=True)
ax[0].set_ylabel('Count of venues (relative)', fontsize=25)
for k in range(0, nclusters):
#Set same y axis limits
ax[k].set_ylim(0,1.1)
ax[k].xaxis.set_label_position('top')
ax[k].set_xlabel('Cluster ' + str(k+1), fontsize=25)
ax[k].tick_params(labelsize=20)
plt.sca(ax[k])
plt.xticks(rotation=90)
sns.boxplot(data = wards_venues_scaled[wards_venues_scaled['Cluster'] == k].drop('Cluster',1), ax=ax[k])
plt.show()
wards_clustered.Cluster.value_counts()
# create map
map_clusters = folium.Map(location=[london_lat, london_lon], zoom_start=11)
palette = cm.get_cmap('Set1') #'rainbow'
# set color scheme for the clusters
x = np.arange(nclusters)
ys = [i + x + (i*x)**2 for i in range(nclusters)]
colors_array = palette(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, ward, cluster in zip(wards_clustered['centroid_lat'],
wards_clustered['centroid_lon'],
wards_clustered['NAME'],
wards_clustered['Cluster']):
folium.CircleMarker(
[lat, lon],
radius=5,
popup=("%i : %s " % (cluster+1, ward)),
color=rainbow[cluster],
fill=True,
fill_color=rainbow[cluster],
fill_opacity=0.7).add_to(map_clusters)
map_clusters
So we have 5 main clusters in our map:
Red(1) cluster - has moderate scores with residential and transport being the most popular. These are highly-developed residential suburbs.
Green(2) cluster - underdeveloped, has low frequencies of all categories of venues.
Orange (3) cluster - business or downtown cluster. Highly developed business part of the city, which has very high frequencies for all venue categories except residential. It is part of the city where people mostly work and entertain, but not live here.
Brown(4) cluster - highly developed area, has high frequences with less residential places and more professional ones. Developed professional and industrial areas.
Grey(5) cluster - mid-developed, has moderate frequences of almost all categories with small prevalence of professional places
geo = json.load(project.get_file('geo_Inner_London_Wards.geojson'))
m = folium.Map(location=[london_lat, london_lon], zoom_start=11)
folium.Choropleth(
geo_data=geo,
name='choropleth',
data=wards_clustered,
columns=['NAME', 'Cluster'],
key_on='properties.NAME',
fill_color='Set1',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Cluster Number'
).add_to(m)
for lat, lon, ward, cluster in zip(wards_clustered['centroid_lat'],
wards_clustered['centroid_lon'],
wards_clustered['NAME'],
wards_clustered['Cluster']):
#folium.Circle([lat, lon], radius=350, color='#00000000', fill=True, fill_color='#0066ff', fill_opacity=0.7).add_to(m)
folium.CircleMarker(
[lat, lon],
radius=3,
popup=str(cluster) + ": "+ward,
color=rainbow[cluster],
fill=True,
fill_color=rainbow[cluster],
fill_opacity=0.3).add_to(m)
folium.LayerControl().add_to(m)
m
Now let's take a look at population projections for these wards. Will take a look how much will ward will grow in next 20 years.
wards_population = wards_population.pivot(index='Code', columns='Year', values='Population').reset_index()
wards_population.head()
wards_population = wards_population.rename_axis(None)
wards_population.head()
wards_projection = wards_clustered.merge(wards_population, how='left', left_on="WARD_CODE", right_on="Code")
wards_projection.drop('Code', axis=1, inplace=True)
wards_projection.head()
wards_projection['population_growth'] = (wards_projection['2040'] - wards_projection['2020'])/wards_projection['2020']
wards_projection.head()
Let's take a look at average and median growth in clusters
wards_projection.groupby('Cluster').mean()['population_growth']
wards_projection.groupby('Cluster').median()['population_growth']
# create map
map_clusters_growth = folium.Map(location=[london_lat, london_lon], zoom_start=11)
palette = cm.get_cmap('Set1') #'rainbow'
# set color scheme for the clusters
x = np.arange(nclusters)
ys = [i + x + (i*x)**2 for i in range(nclusters)]
colors_array = palette(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, ward, cluster, growth in zip(wards_projection['centroid_lat'],
wards_projection['centroid_lon'],
wards_projection['NAME'],
wards_projection['Cluster'],
wards_projection['population_growth'].fillna(0)):
#folium.Circle([lat, lon], radius=250, color='#00000000', fill=True, fill_color='#0066ff', fill_opacity=0.07).add_to(map_berlin)
folium.Circle([lat, lon],
radius=100 + 500 * growth,
color='#00000000',
fill=True,
fill_color='blue',
fill_opacity=0.5).add_to(map_clusters_growth)
for lat, lon, ward, cluster in zip(wards_projection['centroid_lat'],
wards_projection['centroid_lon'],
wards_projection['NAME'],
wards_projection['Cluster']):
folium.CircleMarker(
[lat, lon],
radius=2,
popup=("%i : %s " % (cluster+1, ward)),
color=rainbow[cluster],
fill=True,
fill_color=rainbow[cluster],
fill_opacity=0.9).add_to(map_clusters_growth)
map_clusters_growth
We see that a lot of green(2) cluster areas, which we determined as least developed, shows significant population growth. Also there are couple of brown(4) cluster areas which also stand out.
cluster_of_interest = 1
total_2020 = wards_projection['2020'].sum()
total_2040 = wards_projection['2040'].sum()
population_clustered = wards_projection.groupby('Cluster')['2020','2040'].sum()
cluster_2020 = population_clustered.loc[cluster_of_interest][0]
cluster_2040 = population_clustered.loc[cluster_of_interest][1]
bar1 = [(total_2020 - cluster_2020),(total_2040 - cluster_2040)]
bar2 = [cluster_population_2020, cluster_population_2040]
r=[0,2]
names=['2020','2040']
width=1
plt.bar(r, bar1, color='blue', width=width)
plt.bar(r, bar2, bottom=bar1, color='red', width=width)
plt.xticks(r, names)
plt.xlabel("Year")
plt.legend(['Other Areas','Areas of interest'])
plt.show()
share2020= round((cluster_2020/total_2020)*100,2)
share2040= round((cluster_2040/total_2040)*100,2)
print('{} less developed clusters that host {}% of city\'s population in 2020 will host {}% of total population in 2040'.format(
wards_projection.groupby('Cluster').size()[0],
share2020,
share2040))
As we can see our biggest growing cluster won't change its share of population in future. But everything can happened after our project :)
Our analysis shows that there are distinct areas in Inner London which can be grouped together as similar. There are big number of underdeveloped areas according to presense in them different venue categories which can help identify places for new businesses. Population projection data shows that there will not be a huge growth in population in any of the identified cluster. Nevertheless the most "oportunistic" areas will still share more than a half of city population.
Identifying currently underdeveloped areas can give big advantage to earlier businesses and service provides in these areas.
Our main aim was to identify less developed and underdeveloped areas in London. We successfully achieved this goal I believe.
Despite these results it would be interesting to further conduct clustering analysis considering other socio-economic measures like demographics, religion, housing types, employment rates, income per household and many, many other. Fortunately, London Datastore has a huge amount of open data for further analysis and getting deeper insight into future of London areas.